65 research outputs found
Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary -Mixing Processes
Pac-Bayes bounds are among the most accurate generalization bounds for
classifiers learned from independently and identically distributed (IID) data,
and it is particularly so for margin classifiers: there have been recent
contributions showing how practical these bounds can be either to perform model
selection (Ambroladze et al., 2007) or even to directly guide the learning of
linear classifiers (Germain et al., 2009). However, there are many practical
situations where the training data show some dependencies and where the
traditional IID assumption does not hold. Stating generalization bounds for
such frameworks is therefore of the utmost interest, both from theoretical and
practical standpoints. In this work, we propose the first - to the best of our
knowledge - Pac-Bayes generalization bounds for classifiers trained on data
exhibiting interdependencies. The approach undertaken to establish our results
is based on the decomposition of a so-called dependency graph that encodes the
dependencies within the data, in sets of independent data, thanks to graph
fractional covers. Our bounds are very general, since being able to find an
upper bound on the fractional chromatic number of the dependency graph is
sufficient to get new Pac-Bayes bounds for specific settings. We show how our
results can be used to derive bounds for ranking statistics (such as Auc) and
classifiers trained on data distributed according to a stationary {\ss}-mixing
process. In the way, we show how our approach seemlessly allows us to deal with
U-processes. As a side note, we also provide a Pac-Bayes generalization bound
for classifiers learned on data from stationary -mixing distributions.Comment: Long version of the AISTATS 09 paper:
http://jmlr.csail.mit.edu/proceedings/papers/v5/ralaivola09a/ralaivola09a.pd
Learning the optimal scale for GWAS through hierarchical SNP aggregation
Motivation: Genome-Wide Association Studies (GWAS) seek to identify causal
genomic variants associated with rare human diseases. The classical statistical
approach for detecting these variants is based on univariate hypothesis
testing, with healthy individuals being tested against affected individuals at
each locus. Given that an individual's genotype is characterized by up to one
million SNPs, this approach lacks precision, since it may yield a large number
of false positives that can lead to erroneous conclusions about genetic
associations with the disease. One way to improve the detection of true genetic
associations is to reduce the number of hypotheses to be tested by grouping
SNPs. Results: We propose a dimension-reduction approach which can be applied
in the context of GWAS by making use of the haplotype structure of the human
genome. We compare our method with standard univariate and multivariate
approaches on both synthetic and real GWAS data, and we show that reducing the
dimension of the predictor matrix by aggregating SNPs gives a greater precision
in the detection of associations between the phenotype and genomic regions
Composite kernel learning
The Support Vector Machine (SVM) is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning (MKL) enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correpond to channels. 1
Régression semi-supervisée à sortie noyau pour la prédiction de liens
National audienceNous abordons le problème de la prédiction de liens comme une tâche d'apprentissage d'un noyau de sortie par régression à sortie noyau semi-supervisée. En se plaçant dans le cadre de la théorie des espaces de Hilbert à noyau autoreproduisant à valeurs opérateurs pour des fonctions à valeurs vectorielles, nous établissons un nouveau théorème de représentation dédié à la régression semi-supervisée pour un critère des moindres carrés pénalisé. Nous choisissons ensuite un noyau à valeur opérateur défini à partir d'un noyau d'entrée à valeurs scalaires puis nous construisons un espace de Hilbert avec ce noyau comme noyau autoreproduisant. Nous appliquons ensuite le théorème de représentation. La minimisation des moindres carrés pénalisés dans ce cadre conduit à une solution analytique comme dans le cas de la régression ridge qui est donc ici étendue. Nous étudions la pertinence de cette nouvelle approche semi-supervisée dans le cadre de la prédiction de lien transductive. Des jeux de données artificiels étayent notre étude puis deux applications réelles sont traitées en utilisant un très faible pourcentage de données étiquetées
Protein-protein interaction network inference with semi-supervised Output Kernel Regression
National audienceIn this work, we address the problem of protein-protein interaction network inference as a semi-supervised output kernel learning problem. Using the kernel trick in the output space allows one to reduce the problem of learning from pairs to learning a single variable function with values in a Hilbert space. We turn to the Reproducing Kernel Hilbert Space theory devoted to vector- valued functions, which provides us with a general framework for output kernel regression. In this framework, we propose a novel method which allows to extend Output Kernel Regression to semi-supervised learning. We study the relevance of this approach on transductive link prediction using artificial data and a protein-protein interaction network of S. Cerevisiae using a very low percentage of labeled data
Mutations in the Polycomb Group Gene polyhomeotic Lead to Epithelial Instability in both the Ovary and Wing Imaginal Disc in Drosophila
Most human cancers originate from epithelial tissues and cell polarity and adhesion defects can lead to metastasis. The Polycomb-Group of chromatin factors were first characterized in Drosophila as repressors of homeotic genes during development, while studies in mammals indicate a conserved role in body plan organization, as well as an implication in other processes such as stem cell maintenance, cell proliferation, and tumorigenesis. We have analyzed the function of the Drosophila Polycomb-Group gene polyhomeotic in epithelial cells of two different organs, the ovary and the wing imaginal disc.Clonal analysis of loss and gain of function of polyhomeotic resulted in segregation between mutant and wild-type cells in both the follicular and wing imaginal disc epithelia, without excessive cell proliferation. Both basal and apical expulsion of mutant cells was observed, the former characterized by specific reorganization of cell adhesion and polarity proteins, the latter by complete cytoplasmic diffusion of these proteins. Among several candidate target genes tested, only the homeotic gene Abdominal-B was a target of PH in both ovarian and wing disc cells. Although overexpression of Abdominal-B was sufficient to cause cell segregation in the wing disc, epistatic analysis indicated that the presence of Abdominal-B is not necessary for expulsion of polyhomeotic mutant epithelial cells suggesting that additional polyhomeotic targets are implicated in this phenomenon.Our results indicate that polyhomeotic mutations have a direct effect on epithelial integrity that can be uncoupled from overproliferation. We show that cells in an epithelium expressing different levels of polyhomeotic sort out indicating differential adhesive properties between the cell populations. Interestingly, we found distinct modalities between apical and basal expulsion of ph mutant cells and further studies of this phenomenon should allow parallels to be made with the modified adhesive and polarity properties of different types of epithelial tumors
Pyramides de noyaux
National audienceL'apprentissage statistique vise à prédire, mais aussi analyser ou interpréter un phénomène. Nous proposons de guider le processus d'apprentissage en intégrant une connaissance relative à la façon dont sont organisées les similarités entre exemples. La connaissance est représentée par une "pyramide de noyaux", une structure arborescente qui permet d'organiser des groupes et sous-groupes distincts de similarités. Si nous pouvons faire l'hypothèse que peu de (groupes de) similarités sont pertinentes pour discriminer les observations, notre approche fait émerger les groupes et sous-groupes de similarités pertinentes. Nous proposons ici la première solution complète à ce problème, permettant l'apprentissage d'un séparateur à vaste marge (SVM) sur des pyramides de noyaux de hauteur arbitraire. Les pondérations des (groupes de) similarités sont apprises conjointement avec les paramètres du SVM, par optimisation d'un critère que nous montrons être une formulation variationnelle d'un problème régularisé par une norme mixte. Nous illustrons notre approche sur un problème de reconnaissance d'expressions faciales, où les caractéristiques des images sont décrites par une pyramide représentant l'organisation spatiale et l'échelle des filtres d'ondelettes appliqués sur des patchs d'images
Pénalités hiérarchiques pour l'intégration de connaissances dans les modèles statistiques
Supervised learning aims at predicting, but also analyzing or interpreting an observed phenomenon. Hierarchical penalization is a generic framework for integrating prior information in the fitting of statistical models. This prior information represents the relations shared by the characteristics of a given studied problem. In this thesis, the characteristics are organized in a two-levels tree structure, which defines distinct groups. The assumption is that few (groups of) characteristics are involved to discriminate between observations. Thus, for a learning problem, the goal is to identify relevant groups of characteristics, and at the same time, the significant characteristics within these groups. An adaptive penalization formulation is used to extract the significant components of each level. We show that the solution of this problem is equivalent to minimize a problem regularized by a mixed norm. These two approaches have been used to study the convexity and sparseness properties of the method. The latter is derived in parametric and non parametric function spaces. Experiences on brain-computer interfaces problems support our approach.L'apprentissage statistique vise à prédire, mais aussi analyser ou interpréter un phénomène. Dans cette thèse, nous proposons de guider le processus d'apprentissage en intégrant une connaissance relative à la façon dont les caractéristiques d'un problème sont organisées. Cette connaissance est représentée par une structure arborescente à deux niveaux, ce qui permet de constituer des groupes distincts de caractéristiques. Nous faisons également l'hypothèse que peu de (groupes de) caractéristiques interviennent pour discriminer les observations. L'objectif est donc de faire émerger les groupes de caractéristiques pertinents, mais également les caractéristiques significatives associées à ces groupes. Pour cela, nous utilisons une formulation variationnelle de type pénalisation adaptative. Nous montrons que cette formulation conduit à minimiser un problème régularisé par une norme mixte. La mise en relation de ces deux approches offre deux points de vues pour étudier les propriétés de convexité et de parcimonie de cette méthode. Ces travaux ont été menés dans le cadre d'espaces de fonctions paramétriques et non paramétriques. L'intérêt de cette méthode est illustré sur des problèmes d'interfaces cerveaux-machines
- …